-
Notifications
You must be signed in to change notification settings - Fork 590
HDDS-7100. Container scanner incorrectly marks containers unhealthy when DN is shutdown #4951
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
HDDS-7100. Container scanner incorrectly marks containers unhealthy when DN is shutdown #4951
Conversation
|
Hi @sodonnel if you have time you may be interested to review this one since you had looked at this issue originally. |
sumitagrawl
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@errose28 LGTM+1
adoroszlai
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @errose28 for the patch, LGTM.
* master: (79 commits) HDDS-8914. Datanode may fail to start due to duplicate VolumeInfoMetrics (apache#4966) HDDS-8921. Add support for EC in Freon SCM block generator (apache#4982) HDDS-8927. Metadata scanner should not scan unhealthy containers. (apache#4976) HDDS-8929. Avoid list allocation for pipeline search (apache#4980) HDDS-8778. Support recursive volume delete using Ozone sh command. (apache#4842) HDDS-8885. Quota repair count enable quota feature for old bucket/volume. (apache#4941) HDDS-8771. Refactor volume level tmp directory for generic usage. (apache#4838) HDDS-8922. Random EC read pipeline ID causes XceiverClient cache churn (apache#4971) HDDS-8586 Recon. - API for Count of deletePending keys and amount of data mapped to such keys. (apache#4923) HDDS-8908. Intermittent failure in TestBlockDeletion#testBlockDeletion (apache#4958) HDDS-8910. Replace LockManager with striped lock in ContainerStateManager (apache#4962) HDDS-8917. Move protobuf conversion out of the lock in PipelineStateManagerImpl (apache#4965) HDDS-8825. Use apache/hadoop 3.3.5 docker image (apache#4963) HDDS-8906. Avoid stream when getting in-service healthy nodes (apache#4960) HDDS-8907. Store volume count when storage report is updated (apache#4957) HDDS-8905. PipelineManager metrics should not be synchronized (apache#4959) HDDS-8553. Improve scanner integration tests. (apache#4936) HDDS-8854. Avoid unnecessary DatanodeDetails creation for NodeStateManager lookup (apache#4925) HDDS-8315. [Snapshot] Added unit tests for SnapshotDiffManager (apache#4716) HDDS-7968. [Snapshot] Improve KeyDeletingService to reclaim eligible key blocks in snapshot's deletedTable (apache#4935) ...
* master: HDDS-8555. [Snapshot] When snapshot feature is disabled, block OM startup if there are still snapshots in the system (apache#4994) HDDS-8782. Improve Volume Scanner Health checks. (apache#4867) HDDS-8447. Datanodes should not process container deletes for failed volumes. (apache#4901) HDDS-5869. Added support for stream on S3Gateway write path (apache#4970) HDDS-8859. [Snapshot] Return failure message to client for a failed snapshot diff jobs (apache#4993) HDDS-8939. [Snapshot] isBlockLocationSame check should be skipped if object is not OmKeyInfo. (apache#4991) HDDS-8923. Expose XceiverClient cache stats as metrics (apache#4979) HDDS-8913. ContainerManagerImpl: reduce processing while locked (apache#4967) HDDS-8935. [Snapshot] Fallback to full diff if getDetlaFiles from compaction DAG fails (apache#4986) HDDS-8911. Update Hadoop to 3.3.6 (apache#4985) HDDS-8931. Allow EC PipelineChoosingPolicy to be defined separately from Ratis (apache#4983) HDDS-8895. Support dynamic change of ozone.readonly.administrators in SCM (apache#4977) HDDS-6814. Make OM service ID optional for `ozone s3` commands if only one is defined in config (apache#4953) HDDS-8925. BaseFreonGenerator may not complete if last attempts fail (apache#4975) HDDS-7100. Container scanner incorrectly marks containers unhealthy when DN is shutdown (apache#4951) HDDS-8919. Allow EC pipelines to be created and then added to PipelineManager in two steps (apache#4968) HDDS-8901. Enable mTLS for InterSCMGrpcProtocol. (apache#4964) Conflicts: hadoop-hdds/container-service/src/main/java/org/apache/hadoop/ozone/container/common/interfaces/Container.java hadoop-hdds/container-service/src/main/java/org/apache/hadoop/ozone/container/keyvalue/KeyValueContainer.java hadoop-hdds/container-service/src/main/java/org/apache/hadoop/ozone/container/keyvalue/KeyValueContainerCheck.java hadoop-hdds/container-service/src/main/java/org/apache/hadoop/ozone/container/keyvalue/KeyValueHandler.java hadoop-hdds/container-service/src/test/java/org/apache/hadoop/ozone/container/common/ContainerTestUtils.java
…hen DN is shutdown (apache#4951)
…hen DN is shutdown (apache#4951)
What changes were proposed in this pull request?
Ensure container scanners shut down cleanly without marking containers unhealthy if ongoing scans are interrupted due to shutdown.
MutableVolumeSet. The datanode already has a shutdown hook set up inHddsDatanodeService#callthat does a superset of this hook's operations and does them in the correct order.MutableVolumeSetwould close the RocksDB before stopping the scanners, causing an error in the running scanners.InterruptedIOExceptionandClosedByInterruptException.InterruptedExceptionif it is set.What is the link to the Apache JIRA
HDDS-7100
How was this patch tested?
docker compose exec om ozone freon ockg --size=10000000 -n100 --type=RATIS -rTHREE